Fast Triangle Counting through Wedge Sampling 



C. Seshadhri, AN Pinar, and Tamara G. Kolda 

Sandia National Laboratories 
Livermore, California 94551 

{scomand,apinar,tgkolda}@sandia.gov 



ABSTRACT 

Graphs and networks are used to model interactions in a va- 
riety of contexts, and there is a growing need to be able to 
quickly assess the qualities of a graph in order to understand 
its underlying structure. Some of the most useful metrics are 
triangle based and give a measure of the connectedness of 
"friends of friends." Counting the number of triangles in a 
graph has, therefore, received considerable attention in re- 
cent years. We propose new sampling-based methods for 
counting the number of triangles or the number of trian- 
gles with vertices of specified degree in an undirected graph 
and for counting the number of each type of directed tri- 
angle in a directed graph. The number of samples depends 
only on the desired relative accuracy and not on the size 
of the graph. We present extensive numerical results show- 
ing that our methods are often much better than the error 
bounds would suggest. In the undirected case, our method 
is generally superior to other approximation approaches; in 
the undirected case, ours is the first approximation method 
proposed. 

Keywords 

triangle counting, directed triangle counting, clustering co- 
efficient, Hoeffding's inequality 

1. INTRODUCTION 

Over the last decade, graphs and networks have emerged 
as the standard for modeling interactions between entities 
in a wide variety of applications. Graphs are used to model 
infrastructure networks, the world wide web, computer traf- 
fic, molecular interactions, ecological systems, epidemics, 

"This work was funded by the Applied Mathematics Pro- 
gram at the U.S. Department of Energy. Sandia National 
Laboratories is a multi-program laboratory managed and 
operated by Sandia Corporation, a wholly owned subsidiary 
of Lockheed Martin Corporation, for the U.S. Department 
of Energy's National Nuclear Security Administration under 
contract DE-AC04-94AL85000. 



co-authors, citations, and social interactions, among others. 
Despite the differences in the motivating applications, some 
topological structures have emerged to be important across 
all these domains. The most prevalent, and arguably the 
most important, of these topological structures is the trian- 
gle (3-clique). Many networks, especially social networks, 
are known to have an abundance of triangles, which can be 
explained by homophily (people become friends with those 
similar to themselves) and transitivity (friends of friends be- 
come friends). This abundance of triangles, along with the 
information they reveal, motivates metrics such as clustering 
coefficient and the transitivity ratio [28]. 

We show that the total number of triangles t, can be esti- 
mated by sampling a fixed number of wedges and checking 
if they are closed. A wedge is simply a length-2 path, and a 
triangle is a length-3 cycle. We let p be the total number of 
wedges. We can create an estimate t such that 

Pr {|t — f| <ep/3} <<5. 

In Tab. 1, we show how many samples are needed to esti- 
mate the total number of triangles, t, within an accuracy 
of ep/3, i.e., the proportion of 1/3 of the total wedges, at 
99.9% confidence (8 = 0.001). The size of the sample is 
independent of the size of the graph, although each sample 
requires the expense of checking existence of an edge. Not 
only is our proposed method extremely efficient, but it also 
has easy-to-compute error bounds. 



Accuracy (e) 


0.10 


0.05 


0.01 


0.05 


0.001 


Samples 


380 


1,520 


38,005 


152,018 


3,800,451 



Table 1: Number of sampled wedges required for various 
accuracies at 99.9% confidence. 



Our contributions enable fast computation of triangles 
and related metrics in both undirected and directed graphs. 
Specifically, we present 

• a new sampling-based approach for undirected graphs 
for estimating the number of triangles and the 
clustering coefficient; 

• a new sampling-based approach for undirected graphs 
for quickly estimating the number of triangles 
having at least one node of degree d (or, more 
generally, at least one nodes in a set D), as well as the 
degree-wise clustering coefficients; 

• a new sampling-based approach for directed graphs for 
estimating counts of directed triangles; 

• precise error bounds based on known quantities 
for all the of the above estimates; and 



• extensive numerical results confirming the accuracy 
of our method and the bounds as well as comparisons 
to other approaches. 
We show that our sampling-based approach for counting tri- 
angles is more accurate and at least as fast as competing ap- 
proximation approaches. To the best of our knowledge, ours 
is the first approximation approach in the regime of directed 
graphs. 

Given an estimate of the number of triangles for directed 
or undirected graphs, we can compute metrics that are of use 
in a variety of contexts. The clustering coefficient measures 
how tightly the neighbors of a vertex are connected amongst 
themselves. At the global level, this property is an indicator 
of how tightly the communities of the graph are connected 
and may help to predict the behavior of individuals in the 
network. For instance, Coleman [11] and Portes [20] use the 
clustering coefficient to predict to likelihood of going against 
social norms. Burt, on the other hand, underlined the im- 
portance of nodes that can serve as a bridge between vari- 
ous communities [7] and tied this observation to the number 
of open triangles in a vertex [8]. Welles et al. studied the 
variance of clustering coefficients for different demographics 
groups and found that adolescents are more likely to have 
connected friends than adults and are even more likely to 
terminate connections with friends that are not connected 
to their other friends [15]. 

Triangles have also been used in graph mining applications 
such as spam detection [3]. Eckman and Moses [14] inter- 
preted the clustering coefficients as a curvature and showed 
that connected regions of high curvature on the WWW char- 
acterized common topics. 

Directed triangles are important motifs for comparing and 
characterizing graphs [18, 19, 12, 21, 5]. For graph databases, 
exploiting frequent patterns have also been proposed for ef- 
ficient query processing [24, 29]. 

In our earlier work, we have used distribution of degree- 
wise clustering coefficients as the driving force for a new 
generative model, Blocked Two-Level Erdos-Renyi [23]. In 
this work, we have not only looked at the clustering coeffi- 
cient, but also how the clustering coefficients related to the 
degree distribution, which motivates our algorithms in §4. 

1.1 Sketch of Results 

We present an extremely efficient sampling technique for 
estimating the number of triangles and clustering coefficient. 
Recall that the clustering coefficient of an undirected graph 
G — (V, E) is given by 

3t _ 3 x total number of triangles , . 

p total number of wedges 

In Fig. 1, for example, (5>@-(§) and (3>@-(5) are two 
wedges centered at (4). We say a wedge is closed if it is part 
of a triangle; otherwise, we say the wedge is open. Thus, (5)- 
(4)-(6) is an open wedge, while (3)-(4)-(5) is a closed wedge. 
We can interpret c as the probability that a random wedge 
is closed. 

Suppose we pick a sequence of j — 1, . . . , k random wedges; 
let Xj be a random variable associated with the jth random 
wedge such that Xj = 1 if the wedge is closed and Xj — 
if it is open. Define X = ^J =1 4 It is easy to see that 

c = E[X]. 

We show that c can be estimated to very high accuracy by 




Figure 1: Example graph with 12 wedges and 1 triangle. 

sampling a constant number of wedges and checking if they 
are closed. Specifically, we prove 

Pr{\X-c\ >e}<5 

using k — [0.5 e~ 2 ln(2/<5)] samples. Note that the number 
of samples does not depend on the size of the graph. For 
instance, it requires fewer than 2,000 samples to have an 
absolute error of 0.05 with 99.9% confidence; fewer than 
40,000 samples is needed for an absolute error of 0.01 at 
99.9% confidence. 

This translates directly to an estimate of the number of 
triangles, i.e., if we define t = Xp/3, then 

Pr {|* — *| > ep/3} <<5. 

Hence, we can bound the error in our estimate of the number 
triangles as a fraction e of the total number of wedges with 
confidence given by 5. 

Through extensive numerical studies, we show that our 
proposed algorithm is much faster than direct enumeration 
and has less variance (and tighter bounds) than previously 
proposed approximation approaches. 

We also extend this basic premise to computing the degree- 
wise clustering coefficients and triangles as well as counting 
directed triangles in a directed graph. 

1.2 Related Work 

The enumeration algorithms for finding triangles are ei- 
ther the node- or edge-centric. The node-centric algorithm 
iterates over all nodes and, for each node v, checks all pairs 
among the neighbors of v for being connected. The edge- 
centric algorithm, on the other hand, goes over all edges 
(u, v) and seeks common neighbors of u and v. Chiba and 
Nishizeki [9] proposed a node-centric algorithm that orders 
the vertices by degree and processes each edge only once, 
by its lower degree vertex. They showed that this algo- 
rithm runs in 0(ma(G))-time, where m is the number of 
edges, and a(G)is the arboricity of the graph G (arboricity 
is defined as the minimum number of forests into which its 
edges can be partitioned and can be considered as a mea- 
sure of how dense the graph is). Schank and Wagner [22] 
used the same idea for their forward algorithm. Cohen [10] 
and Suri and Vassilvitskii [25] independently proposed the 
same idea. Latapy proved that the forward algorithm runs 
in 0(m 3 / 2 )-time and proposed improvements that reduce 
the search space [17]. Latapy also showed that the runtime 
of this algorithm becomes Ofmn 1 '") for graphs with power- 
law degree distributions, where a is the power-law coefficient 
and n is the number of vertices [17]. More recently, Berry 
et al. improved this bound to 0(m) when the power-law 
coefficient is at least 7/3 [4]. 

To cope with the ever increasing data sizes, streaming 
algorithms have been proposed to count the number of tri- 



angles [2, 3, 13, 6]. The work by Buriol et al. [6] is particu- 
larly important for this paper since their sampling strategy 
is similar to what we use for estimating undirected triangles. 
Despite the similar sampling approach, the error and con- 
fidence bounds in two studies are different and their work 
focuses only on the number of triangles, where we extend 
this sampling approach to directed triangles and distribu- 
tion of triangles. 

Another sampling-based approach was proposed by Tsou- 
rakakis et al. [27]. Their algorithm, Doulion, reduces the 
size of the graph by randomly sparsifying the graph. Specif- 
ically, a smaller graph is constructed by keeping each edge 
in the original graph with a given probability p. Then the 
number of triangles in the original graph is estimated by 
multiplying the number of triangles of the small graph by 
f? . The error bounds of this algorithm rely on two param- 
eters that we cannot know in advance. The first parameter 
is the number of triangles, which is what we are trying to 
compute, hence the algorithm offers little guidance about 
the quality of an estimation or what would be a good p to 
use to a desired error and confidence bound. The other is 
the number of pairs of triangles that share an edge, which 
points to a particular weakness of this algorithm. Consider 
the graph in Fig. 2; the edge between vertices u and v will be 
dropped with probability 1 — p removing all 4 triangles from 
the graph. In practice, this causes large variations in pre- 
dictions. We compare against this method in our numerical 
results. 

Tsourakakis [26] and Avron [1] used the spectral proper- 
ties of the adjacency matrices of the graphs to approximate 
the number of triangles; specifically, the number of trian- 
gles, t, is equal to t — g $3<Li wnere the -V s are the 
eigenvalues of the adjacency matrix. It may be possible to 
estimate a few of the largest eigenvalues in order to give 
an approximation to t. Finally, Suri and Vassilvitskii pro- 
posed a MapReduce implementation for exact counting of 
triangles [25]. 




Figure 2: Drawback of edge sampling to construct a smaller 
graph: Omission of edge (u, v) eliminates ten potential tri- 
angles because it is shared. 

2. PRELIMINARIES 

Our results derive from the following well-known result 
by Hoeffding on the accuracy of estimating the mean from 
a few random samples. We make no assumptions on the 
distribution of the random variables. 

Theorem 1 (Hoeffding [16]). Let Xi,X 2 , ■ ■ ■ ,X k be 
independent random variables with < Xj < 1 for all i — 
1, . . . , k. Define X = \ £)* =1 Xj . Let p = E[X] . Then for 
e £ (0, 1 — p), we have 

Pr {\X - p\ > e) < 2exp(-2fe 2 ). 



Note that the requirement that e < 1 — p is for convenience. 
If e > 1 — p, then the implication is that \X\ > 1, which 
violates the assumption of the theorem. In other words, if e 
is too large, then the probability is zero. We use this more 
convenient corollary in the proofs of our theorems. 

Corollary 2. For positives, 5, setk — [0.5e -2 ln(2/<5)] . 
Let Xi,X2,...,Xk be independent random variables with 
< X, < 1 for alii = l,...,k. Define X = \ £)* =1 Xj. 
Let p = ELY]. Then, 

Pr {\X - p\ > e} < S. 

Proof. Let 2exp(— 2e 2 k) = 5, solve for k, and apply 
Thm. 1. □ 

3. COUNTING TRIANGLES 

We first consider the problem of counting all triangles in 
an undirected graph. This is closely related to estimating 
the clustering coefficient. 

Our goal is to estimate t, the total number of triangles, in 
an undirected graph G = (V, E). Let n = \V\ and m = \E\. 
Without loss of generality, assume the vertices are indexed 
by i = 1, . . . , n. Let di denote the degree of vertex i. The 
number of wedges centered at node i is given by 



Note that Q) = (°) = 0. Let W denote the set of all wedges 
in G. The total number of wedges is p — \W\ = £^ pi. 

We derive a result on the accuracy of estimating the clus- 
tering coefficient. 

Theorem 3 (Clustering Coefficient). Fore, 5 > 0, 
set k = [0.5 e~ 2 ln(2/<5)] . For j — 1, . . . , k, choose wedge Wj 
uniformly at random (with replacement) from W and let Xj 
be defined as 

J 1 , if Wj is closed, 
[ 0, otherwise. 

Define X=\ YlLi X j- Then 

Pr{\X-c\ > e} <<5, 

where c is the clustering coefficient defined in (1). 

Proof. Recall that c is the proportion of wedges that are 
closed. Thus, it is straightforward to observe that c = E[X]. 
The proof the follows directly from Cor. 2. □ 

Corollary 4 (Counting Triangles). Let the condi- 
tions of Thm. 3 hold. Define t = Xp/3 and i = ep/3. Then 

Pr{|t-t| >i}<8. 

Proof. Since t — cp/3 per (1), this corollary follows im- 
mediately from Thm. 3. □ 

Observe that the number of samples, k, does not depend 
on the size of the graph. We say that e is the error and 
1 — S is the confidence. Fig. 3 shows the number of samples 
needed for different error rates. We show three different 
curves for difference confidence levels. Increasing the confi- 
dence has minimal impact on the number of samples. The 
number of samples if fairly low for error rates of 0.1 or 0.01, 



but it increases with the inverse square of the desired error. 
Nonetheless, the three million samples required for an error 
rate of e — .001 at 99.9% confidence requires only a few 
seconds of calculations on most serial machines. 




0.02 0.04 0.06 0.08 0.1 

Error (e) 



Figure 3: The number of samples needed for different error 
rates and different levels of confidence. A few data points 
at 99.9% confidence are highlighted. 



Algorithm 1 Hoeffding Triangle Estimate 
Given error e and failure probability 5. 

1: Calculate degree di for i — 1, . . . ,n 

2: Set pi = di(di — l)/2 for i = 1, . . . ,n, and p = J2iPi 

3: Set z\ = 1 and Zi+i = 2 YT'=i Pi' + 1 f° r i = 1, ■ ■ ■ ,n 

4: k <- [0.5£- 2 ln(2/<5)] 

5: cnt <— 

6: for j = 1, . . . , k do 

7: r «- Uniform[{l, . . . , 2p}] 

8: Find i such that Zi < r < Zi+x. 

9: £' <r- [(r - sn)/(di - 1)J + 1 
10: £"<-(r-z i )-(di-l)(i'-l) + l 
11: if £" > £' then 
12: t <-£" + l 

13: end if 

14: i <— index of I'th neighbor of j 
15: i" <s— index of ^"th neighbor of j 
16: if G E then 

17: cnt «- cnt + 1 

18: end if 
19: end for 

20: c = cnt/fc > Estimate for c 

21: £ = p(|cnt)/fc > Estimate for t 



We design an algorithm for estimating the clustering co- 
efficient and number of triangles and analyze its complexity. 
The basic premise is to select a number of wedges uniformly 
at random and check whether or not each is closed. There 
are numerous ways that this can be implemented. For in- 
stance, we can select vertex i with probability equal to Pi/p 
and then select two of its neighbors uniformly at random 
without replacement. In this case, the overall probability of 
selecting a particular wedge is Pi/p x l/( 2 4 ) — 1/p- The im- 
plementation we describe in Alg. 1 directly chooses a wedge 
at random. However, we do not explicitly enumerate all 
wedges. Instead, we have an implicit mapping of each ran- 
dom number to each particular wedge. To make this work 
cleanly, each wedge is actually listed twice as @-(v)-® and 
®~®~®' an< ^ we wm P lc k a random number in {1, ... , 2p}. 

We consider the algorithm in some detail. 

• Step 1 is actually the most expensive step, costing 
0(m) to calculate the degrees of all vertices. 

• Step 2 just computes the number of wedges per degree, 
and p is the total number of wedges. 

• Step 3 calculates the edges of the "wedge bins" corre- 
sponding to each vertex. For vertex i, its bin size is 2pi 
since each wedge appears twice. If vertex i has Pi = 0, 
then it will never be selected by the requirement that 
Zi+i > r (strict inequality). 

• Steps 8-15 are converting the random number r se- 
lected in Step 7 into an actual wedge. Note that Step 8 
can be performed using a binary search at a cost of 
O(logn). The cost of Steps 14 and 15 are O(l), when 
the standard adjacency list format is used to store the 
graph. 

• Step 16 requires checking the existence of an edge in 
the graph. The cost of this is O(logm). 

We may conclude that the total cost of the method is 0(m) 
for preprocessing the degree distributions and O(fclogm) for 
checking the closure of k wedges. 

We compare this method to exact enumeration and the 
Doulion method in §6. 



4. COUNTING TRIANGLES PER DEGREE 

Here we consider the problem of counting a subset of tri- 
angles in a graph, i.e., all those that contain a node of some 
specified degree. Likewise, we consider the problem of esti- 
mating the clustering coefficient for all wedges centered at 
nodes of specified degree. In this case, the estimate cluster- 
ing coefficient does not lead directly to an estimate for the 
number of triangles. However, both use the same basic data. 

The node- level clustering coefficient (first used in [28]) is 

ti number of triangles incident to node i 
1 Pi number of wedges centered at node i 

The degree- wise clustering coefficient, c<j, is the average of 
d for nodes of degree d. Define Vd~{i£V\di — d}. Let 
rid = \Vd\' Then we can write Cd as 

Cd= ^ E ^ Cl - (2) 

Let Wd be the set of wedges centered at a node of degree 
d. We partition Wd into four disjoint subsets as follows: 

W d ,o = { w G W d j w open } , 

Wd,i = { W G Wd I w closed and has 1 degree-d node } , 
Wd,2 = { w G Wd | w closed and has 2 degree-d nodes } , 
Wd,3 = { w G Wd | W closed and has 3 degree-d nodes } . 

The total number of wedges centered at a degree d node is 

Pd = \ W d \ = nd(V\ ■ 

Define p d , q = |Wd, 9 | for Q = 0, 1, 2, 3. Clearly, p d = J2 q Pd,q- 
It is easy to show that (2) can be rewritten as 

Pd.l +Pd.2 +Pd,3 

Cd = • 

Pd 

We are also interested in the number of triangles having one 



or more degree-d nodes, denoted by td- Observe that 



and let Yj be defined as 



1 1 



(3) 



since for each triangle there is either one wedge in Wd,i, two 
wedges in Wd,2 or three wedges in Wd,3- 

Theorem 5 (Degree- wise Clustering Coefficient). 
For e,S > 0, set k = [0.5 e~ 2 ln(2/<5)] . For j = 1, . . . , k, 
choose wedge Wj uniformly at random (with replacement) 
from Wd and let Xj be defined as 



Xi = 



1, if Wj is closed, 
0, otherwise. 



Define X = ± J2i=i x j- Then 

Pr{\X-Cd\ >e} <S, 
where Cd is the degree-wise clustering coefficient from (2). 

Proof. Observe that Cd = E[J?] since it is the probability 
that a random wedge in Wd is closed. The proof follows 
immediately from Cor. 2. □ 



Algorithm 2 Hoeffding Degree-d Triangle Estimate 
Given set of degrees D, error e, and failure probability S. 
1: Calculate degree di for i — 1, . . . , n 
fdi(di-l)/2 if df 6 Z), 



2: Set pi = 



3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 



otherwise, 



for i = 1, 





Set po =J2i Pi- 
Set z\ = \ and Zi+i — 2y^ / _ 1 + 1 for i = 1, 
k <- [0.5e- 2 ln(2/5)l 
cntl <- 0, cnt2 «- 0, cnt3 «- 
for j = 1, . . . , k do 

r <r- Uniform[{l, . . . ,2p d }] 

Find i such that Zi < r < Zi+\. 

£' <- [(r - Zi)/(di - 1)J + 1 

(r - Zi) - (di - - 1) + 1 

if I" > £' then 

r t-e' + i 

end if 

i' index of £'ih neighbor of j 
i" index of ^"th neighbor of j 
if (*',*") G -B then 

if dii G D and d 4 // g D then 
cnt3 ^— cnt3 + 1 
G D or dj/, 
cnt2 + 1 



G D then 



else if djj 

cnt2 
else 

cntl cntl + 1 
end if 
end if 
end for 

cd = (cntl + cnt2 + cnt3)/fe t> Estimate for cd 

tr> = pu(cntl + |cnt2 + |cnt3)/fc > Estimate for to 



Theorem 6 (Degree- wise Triangle Count). For 
e,S > 0, set k = [0.5 e -2 ln(2/<5)] . For j = 1, . . . ,k, choose 
wedge Wj uniformly at random (with replacement) from Wd 



1, ifw£W d ,i, 

|, lfw£W d ,2, 
|, lfw£W d ,3, 



0, if w G Wd,o (open) . 
Let Y = | 5jj=i X? • Define i —Y ■ pd and e = epd ■ Then 
Pr{|t-i d | > e} < 5, 

where td is the number of triangles having one or more ver- 
tices of degree d. 

Proof. We claim E[F] = E[Y] = td/pa- Suppose that w 
is selected from Wd uniformly at random. Observe that 

E[Y] =Pr{we Wd } + Pr{ W eW d , 2 } + Pr{we W d , 3 } 



_ , Pda 1 

2 ' Pd 3 



Pd 
td/Pd, 



2 



per (3). Hence, from Cor. 2 we have 

Pr{\Y-t d /pd\ >e} <S, 

and the theorem follows by multiplying the inequality by 
Pd- □ 

The algorithm to compute the degree-wise clustering co- 
efficient and triangle count is shown in Alg. 2 in essence. We 
have generalized the idea here for any set of specified degrees 
D C { 1, . . . , d max }. If D = { 1, . . . , d max }, it is easy to see 
that this is equivalent to Alg. 1. There are three counts cor- 
responding to the number of closed wedges with 1, 2, and 3 
vertices with degrees in D, respectively. If only interested in 
the clustering coefficient, then there is no need to split the 
counts. In Step 3, we define po to be the number of wedges 
with a node of degree d G D at their center. Similarly, we 
define Cd to be the average of all Ci such that di G D and 
in to be the number of triangles with at least one vertex i 
such that di G D. 

5. COUNTING DIRECTED TRIANGLES 

Counting triangles in directed graphs is considerably more 
difficult because there are seven types of directed triangles 
(up to isomorphism); see Fig. 4. Nonetheless, the same prin- 
ciples apply. 




Figure 4: All different directed triangles 

For directed graphs, there are three types of edges: out- 
ward, inward, and bidirectional. The number of outward, 
inward, and bidirectional edges incident to vertex i is called 



the out-degree, in-degree, and bi-degree, respectively. These 
are denoted by df, d~ , and d*. 

Given these three edges types, there are six different types 
of wedges, labeled by lower case Roman numerals in Fig. 5. 
For any wedge type 7 G T = { i, ii, iii, iv, v,vi}, define 



Pi (7) = no. of wedges of type 7 centered at node i, and 

71 

P(l) ~ ^^Pi(l) — number of wedges of type 7. 

i = l 

The formulas for calculating Pi(7) are given in Tab. 2. 




Figure 5: All different directed wedges 



7 


i ii 


iii iv v 


vi 


Pi(l) 


(1) 4d- 


{% ) d *d+ d*d~ 


(f) 



Table 2: Number of wedges per node for each wedge type 

Finally, we come back to the seven different types of trian- 
gles, labeled by lowercase letters in Fig. 4. For any triangle 
type a G S = { a,b,c,d,e, f, g}, define 

t(a) = number of triangles of type a. 

We let uj(-y, u) be the number of wedges of type 7 is a tri- 
angle of type a. These values are listed in Tab. 3. We also 
define r CT = { 7 G T 0^(7, a) > }, i.e., the subset of wedges 
participating in triangle type a. 

Wedge types (7) 





i ii iii iv v vi 


a 


111 


b 


3 


c 


1 2 


d 


1 11 


e 


1 2 


f 


1 1 1 


g 


3 



Table 3: Number of each wedge type per triangle type 

We define W(p/) to be the set of wedges of type 7. We 
partition it into eight subsets as follows. Let 

W(7, 0) = { w G W("f) j w is open } , 

W("/, a) = { w G W(-y) I w closes to be of type a } . 

Then we can write the number of triangles of type a as 
\W(r,,a) 



for any 7 G r o 



Theorem 7 (Directed Triangle Count). Assume we 
wish to count triangles of type a. Choose 7 G T a . For e, S > 
0, set k = [0.5 £~ 2 ln(2/<5)] . For j = 1, . . . , k, choose wedge 
Wj uniformly at random (with replacement) from W(-y) and 
let Xj be defined as 



1, if Wj closes to form a 
0, otherwise. 



e of type a 



1 Xj- 



Define X = \ £ 
e — ep(7)/cj(7, a). Then 

Pr{|t- 



Define t = Xp^/uifa, a) and 



*(<?■) I >i}<5. 



The proof follows the same principals as the previous the- 
orems and so is omitted. 

We will not write down the full algorithm for all scenarios 
because it is too complex to be easily represented in pseu- 
docode. Instead, we focus on triangle type (d) and wedge 
type (ii) as a representative. This algorithm is presented in 
Alg. 3. In Step 1, we calculate just the in- and out- degrees, 
but we omit the bi-degrees since they are not used explicitly 
for finding this triangle type. Recall, however, that these 
in- and out-degree counts exclude any bidirectional edges. 
In general, we recommend choosing wedge type 7 G T a with 
the lowest total wedge count so that the sampling will visit 
a larger fraction of the set, but this specific choice is not 
necessary from a theoretical point of view. 

Algorithm 3 Hoeffding Type (d) Triangle Estimate 
Given error e and failure probability 5. 



1 
2 

3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 



for i — 1, . . . , n 

,n, and p = Y,iPi 
a Pi' 



+ 1 for i = 1. 



Calculate degree df and d. 
Set pi = dfd~ for i = 1, . . 
Set zi = 1 and z i+ i = Yll'- 
k <- [0.5£- 2 ln(2/<5)] 
cnt <— 

for j = 1, . . . , k do 

r Uniform[{l, . . . ,p}] 

Find i such that z% < r < Zi+i. 

e<- L(r-*)/dTJ + i 

t ^(r-Zi) -dr {I' -1) + 1 
i' <— index of I'ih out-neighbor of j 
i" index of ^"th in-neighbor of j 
if (i', i") G E and (i", i') G E then 

cnt <— cnt 4- 1 
end if 
end for 



t — p ■ cnt / k 



> Estimate for t 



6. EXPERIMENTAL RESULTS 
6.1 Experimental Setup 

We have implemented all our algorithms in C, and have 
run our experiments on a computer equipped with a 2.3GHz 
Intel core i7 processor with 4 cores and 256KB L2 cache 
(per core), 8MB L3 cache, and an 8GB memory. We have 
performed our experiments on 21 graphs chosen out of the 
SNAP data set [30]. In all cases, edge weights and self-edges 
are omitted. For the undirected tests, we ignore direction 
on the edges. From this collection, we have chosen matrices 
with higher number of triangles. The properties of these 
matrices are presented in Tab. 4. 
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Table 4: Comparison of triangle counting schemes for undirected graphs. We compare Doulion (D10 and D25) and our 
Hoeffding (H) approach along with an efficient method for full enumeration (E). The errors are reported as the percentage of 
p/3 (the maximum number of possible triangles). 



Below, we compare algorithms with the forward (enumer- 
ation) algorithm [9, 22, 10, 25] and Doulion approach [27]. 
For the forward algorithm, we have ordered vertices accord- 
ing to their degrees and used the vertex numbering as a 
tie-breaker. For the Doulion approach, we have used the 
forward algorithm after down-selecting the edges. 

6.2 Counting Triangles 

In Tab. 4, we summarize experiments on 21 graphs from 
the SNAP collection [30]. Recall that n is the number of 
vertices, m is the number of edges, and p is the number of 
wedges, and t is the number of triangles. We compare the 
following methods: 

• Enumeration (E) - Enumerates all triangles, being 
clever to look at only one wedge per triangle rather 
than three [9, 22, 10, 25]. 

• Doulion (D) - Estimates the number of triangles by 
working with a reduced graph; Edges are selected from 
the original graph with probability p [27] . We use p = 
1/25 (labeled D25) and p = 1/10 (labeled D10). 

• Hoeffding (H) - This is our proposed approach. Here 
we have used k = 26, 500 samples, corresponding to an 
error of e = 0.01 at 99% confidence. This means we 
expect the difference between our estimate and the real 
answer to be no more than 1% of p/3. 

Note that the enumeration approach gives the true num- 
ber of triangles (t). We show the estimate t, computed by 
each approximation method, as well as the error, which is 
shown as a percentage of 1 /3 of the total number of wedges 
(the maximum number of possible triangles if every wedge 
were closed), i.e., 

error = 100|t - i\/(p/3). 
For Hoeffding, we expect 

error = 100 £ - t\/(p/3) < lOOe = 1, 
with 99% confidence. Indeed, the maximum error is 0.49, 



well under the bound. As expected, D10 is generally bet- 
ter than D25 (due to high variance, it is occasionally worse) 
since it uses a larger sample of the graph. Hoeffding is gen- 
erally as good or better than Doulion. On average, the error 
of Doulion is much larger than that of Hoeffding. We could 
use a higher value of p in Doulion and save more edges, but 
then it would take more time. 

The timing comparisons are also shown. It is worth not- 
ing that 90-99% of the time for the Doulion and Hoeffding 
methods is just reading the graph. Nevertheless, our objec- 
tive is to show that the Hoeffding method is at least as fast as 
the Doulion methods while achieving better accuracy. For 
graphs with a large number of wedges (e.g., as-skitter), 
the estimation methods are an order of magnitude faster 
than direct enumeration. 

The clustering coefficient is directly proportional to the 
number of triangles, so we do not include it in Tab. 4. 

Fig. 6 shows the convergence of the clustering coefficient 
estimate as the number of samples increases. The dashed 
line shows the error bars at 99.9% confidence. Indeed, it is 
always possible to increase the number of samples, adding 
to those already completed, in order to further reduce the 
error bound. Given the number of samples computed and 
the desired confidence, it is possible to determine the error 
bars, as we show here. The level of confidence does not 
change them much. 

6.3 Counting Triangles per Degree 

One of the unique benefits of our approach is the deriva- 
tion of a method to count only triangles with a specified 
degree as well as the clustering coefficient for a specified de- 
gree. For instance, the BTER model of [23] can accurately 
capture the degree-wise clustering coefficients, but these are 
prohibitive to compute for large graphs because it requires 
enumerating all triangles. 

In Fig. 7, we compare true and predicated clustering co- 
efficients by degree. We use just k = 6, 622 samples per 
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True and estimated clustering coefficients, using our Hoeffding algorithm with e = 0.02, 8 = 0.01, and k — 6,622 



9 
'o 

3= 

CD 
O 

o 



0.2 



0.18 



0.16 



CD 

% 0.14 
o 

0.12 



L 



True Value 
Hoeffding Estimate 



Error @ 99.9% Confidence 



0.5 1 1.5 2 
Number of Samples 



2.5 



3 

4 



x 10 



Figure 6: Convergence of clustering coefficient estimate as 
the number of samples increases for the amazon0505 graph. 



degree, since that gives an absolute accuracy of e = 0.02 
with 99% confidence. 

In Tab. 5, we compare are predictions of triangles by de- 
gree with the actual counts computed by enumeration. We 
show the predicated error range for the Hoeffding estimate, 
and the actual difference (shown in the last column) is al- 
ways well within the bound. A nice feature of our algorithm 
is that it can be adapted to any set of triangle degrees, so 
we show the set of triangles that have at least one degree 
being in the set {3, 4, 5}. 

6.4 Counting Directed Triangles 

In Tab. 6, we show the results of our method for counting 
directed triangles. We specify the number of directed edges, 
which may be different than the undirected versions consid- 
ered in Tab. 4. For the directed triangles, we consider only 
the Type (d) triangle (see Fig. 4). We use k = 3,800,451 
samples, corresponding to e = 0.001 and 8 = 0.001 (99.9% 
confidence) . We have only tested this for relatively small tri- 
angles for which we can also do direct enumeration to com- 
pare the results. Observe that the estimates are typically 
an order of magnitude more accurate than the estimated 
bounds (which are not unreasonable in the first place). 
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Table 5: Triangles by Degree, using e - 
k — 6, 622 samples. 
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Table 6: Count of directed triangles of type (d). 



7. CONCLUSIONS 

We have developed a novel approach to very fast estima- 
tion of the number of triangles in a graph. The approach 
is premised on sampling wedges and using Hoeffding's in- 
equality (Thm. 1) to bound the estimation error. The bulk 
of the work for our Hoeffding method is the preprocessing 
to determine the degree (or in-, out-, and bi-degree for di- 
rected graphs) of each vertex. From these values, we can 
directly calculate the total number of wedges (or each di- 
rected wedge) and from that compute exact error bounds 
for estimating the number of triangles. 

In our experimental results, we showed that our Hoeffd- 
ing estimation approach is more accurate than Doulion's 
method and at least as fast in terms of computation time. 
We have also showed that it is extremely accurate in terms 



of counting the number of triangles of specified degree or for 
calculating degree-wise clustering coefficients. To the best 
of our knowledge, ours is the first estimation method for 
calculating counts of directed triangles, and it is extremely 
accurate in our experiments. 

A major advantage of our Hoeffding method is that it 
can be easily implemented in a distributed framework. In 
a Hadoop MapReduce framework, for example, we may as- 
sume that every node knows its neighbors (this can be done 
but is a little more complicated when the neighbor list is 
too big to fit in a single mapper) and so can randomly select 
some wedges to check for closure. If the list of wedges to 
check is small (which will generally be the case), the dis- 
tributed cache can be employed and mappers can check for 
wedge closure. One can also consider checking for closure 
in the reducers, but it causes considerably more message 
traffic. 
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